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niques to analyze random matrices with additive structure, while the enhancements in 
this paper cover a wider class of matrix- valued random elements. In particular, these ideas 
lead to a bounded differences inequality that applies to random matrices constructed from 
weakly dependent random variables. The proofs require novel trace inequalities that may 
be of independent interest. 

AMS 2000 subject classifications: Primary 60B20, 60E15; secondary 60G09, 60F10. 
Keywords and phrases: Concentration inequalities, Stein's method, random matrix, 
non-commutative, exchangeable pairs, coupling, bounded differences, Dobrushin depen- 
dence, Ising model, Haar measure, trace inequality. 

This paper is based on two independent manuscripts from late 2012 that both used kernel cou- 
plings to establish matrix concentration inequalities. One manuscript is by Paulin; the other 
is by Mackey and Tropp. The authors have combined this research into a unified presentation, 
with equal contributions from both groups. 

1. Introduction 

Matrix concentration inequalities provide probabilistic bounds on the spectral-norm deviation 
of a random matrix from its mean value. Over the last decade, a growing field of research 
has established that many scalar concentration results have direct analogs for matrices. For 
example, see [1, 16, 23]. This machinery has simplified the study of random matrices that 
arise in applications from statistics [8], machine learning [15], signal processing [2], numerical 
analysis [22], theoretical computer science [24], and combinatorics [17]. 

Most of the recent research on matrix concentration depends on a matrix extension of the 
Laplace transform method from elementary probability. In the matrix setting, it is a serious 
technical challenge to obtain bounds on the matrix analog of the moment generating function. 
The earlier works [1, 16] use the Golden-Thompson inequality to accomplish this task. A more 
powerful argument [23] invokes Lieb's Theorem [10, Thm. 6] to complete the estimates. 

Very recently, Mackey et al. [13] have shown that it is also possible to use Stein's method of 
exchangeable pairs to control the matrix moment generating function. This argument depends 
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on a matrix version of Chatterjee's technique [5, 4] for establishing concentration inequalities 
using exchangeable pairs. This approach has two chief advantages. First, it offers a straightfor- 
ward way to prove polynomial moment inequalities for matrices, which are not easy to obtain 
using earlier techniques. Second, exchangeable pair arguments also apply to random matrices 
constructed from weakly dependent random variables. 

The work [13] focuses on sums of weakly dependent random matrices because its techniques 
are less effective for other examples. The goal of the current research is to adapt ideas from 
Chatterjee's thesis [4] to establish concentration inequalities for more general types of random 
matrices. In particular, we have obtained new versions of the matrix bounded difference in- 
equality (see [23, Cor. 7.5] or [13, Cor. 11.1]) that hold for a random matrix that is expressed 
as a measurable function of weakly dependent random variables. These results appear as Corol- 
lary 4.1 and Corollary 5.2. 

1.1. A First Look at Exchangeable Pairs 

The method of exchangeable pairs depends on the idea that an exchangeable counterpart of a 
random variable encodes information about the symmetries in the distribution. Here is a simple 
but fundamental example of an exchangeable pair of random matrices: 



where {Yj} is an independent family of random Hermitian matrices, J is a random index chosen 
uniformly from {1, . . . , n}, and Yj is an independent copy of Yj. Notice that 



the variance of the independent sum X. When this random matrix is uniformly small in norm, 
we can prove that the sum X concentrates around its mean value. We refer to Theorem 3.1 or 
the result [13, Thm. 4.1] for a rigorous statement. 

1.2. Roadmap 

Section 1.3 continues with some notation and preliminary remarks. In Section 2, we describe 
the concept of a kernel Stein pair of random matrices, which stands at the center of our analy- 
sis. In Section 3, we state abstract concentration inequalities for kernel Stein pairs. Afterward, 
Sections 4 and 5 derive bounded difference inequalities for random matrices constructed from 
independent and weakly dependent random variables. As an application, we consider the prob- 
lem of estimating the correlations in a two-dimensional Ising model in Section 6. We close with 
some complementary material in Section 7. The proofs of the main results appear in three 
Appendices. 

1.3. Notation and Preliminaries 




and 






First, we introduce the identity matrix I and the zero matrix 0. Their dimensions are determined 
by context. 
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We write M. d for the algebra of d x d complex matrices. The symbol ||-|| always refers to the 
usual operator norm on M d induced by the i d vector norm. We also equip M d with the trace 
inner product (B, C) := tr[B*C] to form a Hilbert space. 

Let M d denote the subspace of M. d consisting of d x d Hermitian matrices. Given an interval 
/ of the real line, we define M. d (I) to be the family of Hermitian matrices with eigenvalues 
contained in /. We use curly inequalities, such as =<(, for the positive semidefmite order on the 
Hilbert space £ d and the Hilbert space M d . 

Let / : / — > K be a function on an interval / of the real line. We can lift / to form a standard 
matrix function f : M. d (I) — > M d . More precisely, for each matrix A £ M. d (I), we define the 
standard matrix function via the rule 

f( A ) : =y2, J( X k)u k u* k where A = V X k u k u* k 

'fc=l *— — 'k=l 

is an eigenvalue decomposition of the Hermitian matrix A. When we apply a familiar scalar 
function to an Hermitian matrix, we are always referring to the associated standard operator 
function. To denote general matrix-valued functions, we use bold uppercase letters, such as 
F,H,&. 

For M £ M d , we write Re(M) := ~(M + M*) for the Hermitian part of M. The following 
semidefinite relation holds. 

Re(AB)= AB + BA 4 A2 + B " for all A,Be M d . (1.2) 

This result follows when we expand the expression (A — B) 2 0. As a consequence, 

4 i R\ 2 4 2 4- R 2 

4 for all A, B E M d . (1.3) 



In other words, the matrix square is operator convex. 

Finally, we need two additional families of matrix norms. For p £ [1, oo], the Schatten p-norm 
is given by 

\\B\\ Sp := (tr\B\ p ) 1/p for each B £ M d , 

where \B\ := (B*B) 1 ^ 2 . For p > 1, we introduce the matrix norm induced by the £p vector 
norm: 

||B|| := sup ' p for each S £ M d (1.4) 



"ip 



In particular, the matrix norm induced by the if vector norm returns the maximum if norm 
of a column; the norm induced by if^ returns the maximum if norm of a row. 



2. Exchangeable Pairs of Random Matrices 

The basic principle behind this paper is that we can exploit the symmetries of the distribution 
of a random matrix to obtain matrix concentration inequalities. One way to encode symmetries 
is to identify an exchangeable counterpart of the random matrix. This section outlines the 
main concepts from the method of exchangeable pairs, including an example of fundamental 
importance. Once we have an exchangeable pair, we can apply ideas of Chatterjee [4] to obtain 
concentration inequalities, which is the subject of Section 3. 
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2.1. Kernel Stein Pairs 

In this work, the primal concept is an exchangeable pair of random variables. 

Definition 2.1 (Exchangeable Pair). Let Z and Z' be a pair of random variables taking values 
in a Polish space Z. We say that a (Z, Z') is an exchangeable pair when it has the same 
distribution as the pair (Z f , Z). 

In particular, Z and Z' have the same distribution, and Ef(Z,Z') = Ef(Z',Z) for every 
function / where the expectations are finite. 

We are interested in a special class of exchangeable pairs of random matrices. There must be 
an antisymmetric bivariate kernel that "reproduces" the matrices in the pair. 

Definition 2.2 (Kernel Stein Pair). Let (Z,Z f ) be an exchangeable pair of random variables 
taking values in a Polish space Z, and let ^ : Z — > H d be a measurable function. Define the 
random Hermitian matrices 

X : = and X' := V(Z'). 

We say that (X, X') is a kernel Stein pair if there is a bivariate function K : Z 2 — > M. d for 
which 

K(Z, Z') = -K(Z', Z) and E[K(Z, Z') \ Z) = X almost surely. (2.1) 

When discussing a kernel Stein pair (X,X'), we always assume that E\\X\f < oo. We some- 
times write K-Stein pair to emphasize the specific kernel K. 

It turns out that most exchangeable pairs of random matrices admit a kernel K that satis- 
fies (2.1). We describe the construction in Section 2.2. 

Kernel Stein Pairs versus Matrix Stein Pairs. The analysis in the article [13] is based 
on an important subclass of kernel Stein pairs termed matrix Stein pairs. A matrix Stein pair 
(X, X') derived from an auxiliary exchangeable pair (Z, Z') satisfies the stronger condition 

ELY - X'\Z} = aX for some a > 0. (2.2) 

That is, a matrix Stein pair is a kernel Stein pair with K(Z,Z') = a~ l (X — X'). Although 
the paper [13] describes several fundamental classes of matrix Stein pairs, most exchangeable 
pairs of random matrices do not satisfy the condition (2.2). Kernel Stein pairs are much more 
common, so they are commensurately more useful. 

2.2. Kernel Couplings 

Given an exchangeable pair of random matrices, we can ask whether it is possible to equip 
the pair with a kernel that satisfies (2.1). In fact, there is a very general construction that 
works whenever the exchangeable pair is suitably ergodic. This method depends on an idea of 
Chatterjee [4, Sec. 4.1] that ultimately relies on an observation of Stein [21]. 

Stein noticed that any exchangeable pair (Z, Z') of ^-valued random variables defines a 
reversible Markov chain with a symmetric transition kernel P given by 

Pf(z) := E[f(Z') | Z = z] 
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for each integrable function / : 2 — > R. In other words, for any initial value Z( ) G Z, we can 
construct a Markov chain 

Z(0) ~> Z(l) -> Z( 2 ) -> Z( 3 ) -> • • ■ 

where E[/(Z(j+i)) | = Pf(Z^) for each integrable function /. This requirement suffices to 
determine the distribution of each Zr i+1 y 

When the chain (Z^)i> is ergodic enough, we can explicitly construct a kernel that sat- 
isfies (2.1) for any exchangeable pair of random matrices constructed from the auxiliary ex- 
changeable pair (Z, Z'). To explain this idea, we begin with a definition. 

Definition 2.3 (Kernel Coupling). Let (Z,Z') G Z 2 be an exchangeable pair. Let (Z(i))i>o 
and {Z'^)i>o be two Markov chains with arbitrary initial values, each evolving according to the 
transition kernel P induced by (Z, Z'). We call (Z(i), Z[{))i>o a kernel coupling for (Z, Z') if, 

Z(i) _LL Z ( ' 0) | Z( ) and Z' {i) _LL Z (0 ) | Z[ 0) for all z. (2.3) 

The expression U _LL 1/ | IV means that U and V are independent conditional on W. 

The key lemma, essentially due to Chatterjee [4, Sec. 4.1], allows us to construct a kernel 
Stein pair by way of a kernel coupling. 

Lemma 2.4. Let (Z^, Z',~)i>o be a kernel coupling for an exchangeable pair (Z,Z') G Z 2 . 
Let St : Z — > M d be a measurable function with EVD^Z) = 0. Suppose that there is a positive 
constant L for which 

J2°° =0 ll E t*(%)) - * W I Z (0) = z, Z{ 0) — z']\\ < L for all z, z' G Z. (2.4) 

Then (^(Z), ^(Z')) is a kernel Stein pair with kernel 

K(Z,Z') := ^ =Q E[*(Z (i) ) - *(Z( 4) ) | Z m = Z,Z{ 0) = Z'\. (2.5) 

The proof of this result is identical with that of [4, Lem. 4.2], which establishes the same 
formula (2.5) in the scalar setting. Lemma 2.4 indicates that the kernel K associated with an 
exchangeable pair (Z, Z') and a map St tends to be small when the two Markov chains in the 
kernel coupling have a small coupling time. 

2.3. Conditional Variance 

To each kernel Stein pair (X, X'), we may associate two random matrices called the conditional 
variance and kernel conditional variance of X. Ultimately, we show that X is concentrated 
around the zero matrix whenever the conditional variance and the kernel conditional variance 
are both small. 

Definition 2.5 (Conditional Variance). Suppose that (X,X') is a JFC-Stein pair, constructed 
from an auxiliary exchangeable pair (Z, Z'). The conditional variance is the random matrix 

V x := V X (Z) := \ E [(X - X') 2 | Z\ , (2.6) 
and the kernel conditional variance is the random matrix 

V K := V K (Z) := - E [K(Z, Z') 2 \ Z\ . (2.7) 
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The following lemma provides a convenient way to control the conditional variance and the 
kernel conditional variance when the kernel is obtained from a kernel coupling as in Lemma 2.4. 

Lemma 2.6. Let (Z^, Z'^)i> be a kernel coupling for an exchangeable pair (Z, Z') £ Z 2 , and 
let : Z — > M. d be a measurable map. Suppose that (X, X') = (^(Z), *&(Z')) is a kernel Stein 
pair where the kernel K is constructed via (2.5). For each i = 0, 1, 2, . . . , assume that 

E [E[*(%) - ^{Z' {i) ) | Z, Zf \Z] 4 s 2 T(Z) almost surely, (2.8) 

where F : Z — > M d is a measurable map and (sj)j> is a deterministic sequence of nonnegative 
numbers. Then the conditional variance (2.6) satisfies 

Vx 4 T^olXZ) almost surely, 
and the kernel conditional variance (2.7) satisfies 

K 1 {x~^°° \ 2 

V ^ o ( / s i I r(Z) almost surely. 

Proof. Using a continuity argument, we may assume that each Sj > for each integer i > 0. 
For each i, define 3^ := (Z^) — VE^Z^) \ Z, Z'\. By the kernel coupling construction (2.5), 
we have 

1 ^ »OG « .OO 1 x — \°o ^ — \°° 

V K = 2 E« £ i=4J I *1 - 5 E« ^CTO) I 

1 oo oo \ 2 Z] S y \ 

1 oo oo i ?r(z) £i s?r(z A 
L — -\°° x — \ L — \ ^ 

= 2 r < z > = 5 r < z >. 

where the first semi definite inequality follows from (1.2) and the second inequality depends on 
the hypothesis (2.8). Similarly, 

V x = \mZ\Z]4 l -slT{Z). 
This observation completes the proof. □ 
2-4- Example: Matrix Functions of Independent Variables 

To illustrate the definitions in this section, we describe a simple but important example of a 
kernel Stein pair. Suppose that Z := [Z\, . . . , Z n ) is a vector of independent random variables 
taking values in a Polish space Z. Let H : Z — >■ W 1 be a measurable function, and let {-A.j)j>i 
be a sequence of deterministic Hermitian matrices satisfying 



(H(z u ...,z n )- H( Zl , ...,z' j ,..., z n )) 2 4 A 2 



(2.9) 
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where Zj,Zj range over the possible values of Zj for each j. We aim to analyze the random 
matrix 

X := H{Z) -EH(Z). (2.10) 

We encounter matrices of this form in a variety of applications. For instance, concentration 
inequalities for the norm of X have immediate implications for the generalization properties of 
algorithms for multiclass classification [11, 15]. 

In this section, we explain how to construct a kernel exchangeable pair for studying the 
random matrix (2.10), and we compute the conditional variance and kernel conditional variance. 
Later, in Section 4, we use these calculations to establish a matrix bounded difference inequality 
that improves on [23, Cor 7.5]. 

To begin, we form an exchangeable counterpart for Z: 

Z '■= (Z\, . . . , Zj, Zj + i, . . . , Z n ) 

where Z := (Zi, . . . , Z n ) is an independent copy of Z. We draw the coordinate J uniformly at 
random from {1, . . . ,n}, independent from everything else. Then the random matrix 

X' := H(Z') -EH(Z) 

is an exchangeable counterpart for the matrix X. 

To verify that (X, X') is a kernel Stein pair for a suitable kernel K, we establish an explicit 
kernel coupling (Z^, Z'^)i> . For each i > 1, define Z^ to be an independent copy of Z. We 
generate the pair (Z^, Zt~) from the previous pair -Z^i)) by selecting an independent 

random index J, uniformly from {1, . . . , n} and replacing the Jj-th coordinates of both Z(i-i) 
and <%_i) with the Jj-th coordinate of Z^y By construction, the two marginal chains (^(i))i>o 
and {Ztq)i>o evolve according to the transition kernel induced by (Z,Z'), and they satisfy 
the kernel coupling property (2.3). The analysis of the coupon collector's problem [9, Sec. 2.2] 
shows that the expected coupling time for this pair of Markov chains is bounded by n(l + logn). 
Therefore, Lemma 2.4 implies that (X, X') is a kernel Stein pair with 

Eoo 
i= E[H(Z {t) ) - H(Z' {i) ) | Z (0) = Z, Z[ 0) = Z'\. 

Since the two Markov chains couple rapidly, we expect that the kernel is small. 

To bound the size of the kernel, we use Lemma 2.6. For each integer % > 0, define the event 
Si := {J ^ { Ji, . . . , Ji}}- Off of the event £», we have H(Z^) = H(Z', i s); on the event Si, the 
random vectors Z^ and Z',~ can differ only in the J-th coordinate. Therefore, 

E[E[H(Z (l) )-H(Z' {l) )\Z,Zf\Z] 

= E [(P {Si} ■ E[H(Z (l] ) - H(Z' (l) ) | Z, Z', S t }) 2 \ Z] 
4 (1 - 1/nf • E[(H(Z {1) ) - H(Z[ t) )) 2 | Z,Si] 
(1 — l/n) 2i ■ E[A 2 j]. 

The first semidefinite inequality follows from the convexity (1.3) of the matrix square, and 
the second depends on our bounded differences assumption (2.9). Apply Lemma 2.6 with s, = 
(1 - l/nf and T{Z) = E[A 2 j] to conclude that 

V K 4 iKIAj] (5^,(1 - l/n)-f = f E[AJ] = JUj (2.11) 
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and that 

Vx *l E ^ = h^U A '- (2 - 12) 

We discover that the conditional variance and the kernel conditional variance are under control 
when H has bounded coordinate differences. Section 4 discusses how these estimates imply 
that the matrix X concentrates well. 

3. Concentration Inequalities for Random Matrices 

This section contains our main results on concentration for random matrices. Given a kernel 
Stein pair, we explain how the conditional variance and kernel conditional variance allow us to 
obtain exponential tail bounds and polynomial moment inequalities. 

At a high level, our work suggests the following plan of action. You begin with a random 
matrix, X. You use the symmetries of the random matrix to construct an exchangeable coun- 
terpart, X', that is close but not identical to X. You construct a kernel coupling from this 
exchangeable pair, and you compute the conditional variances, Vx and Vk- Then you apply 
the concentration results from this section to control the deviation of X from its mean. In the 
sections to come, we provide specific examples and applications of this template. 

3.1. Exponential Tail Bounds 

Our first result establishes exponential concentration for the maximum and minimum eigenval- 
ues of a random matrix. 

Theorem 3.1 (Concentration for Bounded Random Matrices). Consider a K-Stein pair 
(X,X') G M. d x M d . Suppose there exist nonnegative constants c,v,s for which the conditional 
variance (2.6) and the kernel conditional variance (2.7) of the pair satisfy 

Vx =4 s' 1 ■ (cX + v I) and V K ^ s • (cX + v I) almost surely. (3.1) 

Then, for all t > 0, 

P{A min (X) < -t} < d-exp 

P {A max (X) > t} < d ■ exp <j -- + ^ log ( 1 + - 

{ c c z \ V 

f -t 2 1 

< d ■ exp < > . 

[2v + 2ct J 

Furthermore, 

EA min (X) > -^2v log d 
EA max (X) < a/ 2v log d + c log d. 

Theorem 3.1 extends the concentration result of [13, Thm. 4.1], which only applies to matrix 
Stein pairs. The argument leading up to Theorem 3.1 is very similar with the proof of the earlier 
result. The main innovation is a new type of mean value inequality for matrices that improves 
on [13, Lem. 3.4]. 
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Lemma 3.2 (Exponential Mean Value Trace Inequality). For all matrices A, B,C G H d and 

all s > it holds that 

| tr [C(e A - e B )] | < ^ tr[(s (A - B) 2 + s" 1 C 2 )(e A + e B )]. 
See Appendix B for the proofs of Theorem 3.1 and Lemma 3.2. 

3.2. Polynomial Moment Inequalities 

The second main result shows that we can bound the polynomial moments of a random matrix 
in terms of the conditional variance and the kernel conditional variance. 

Theorem 3.3 (Matrix BDG Inequality). Suppose that (X,X') is a K-Stein pair based on an 
auxiliary exchangeable pair (Z, Z'). Let p > 1 be a natural number, and assume that E\\X\\f 2 < 

oo and E \\K(Z, Z')\\ 2p < oo. Then, for any s > 0, 



x l/2p 



(E||X|||J 1/2p <V2^lU 
We have written for the Schatten p-norm. 

Theorem 3.3 generalizes the matrix Burkholder-Davis-Gundy inequality [13, Thm. 7.1], which 
only applies to matrix Stein pairs. This result depends on another novel mean value inequality 
for matrices. 

Lemma 3.4 (Polynomial Mean Value Trace Inequality). For all matrices A,B,C G M. d , all 
integers q > 1, and all s > 0, it holds that 

|tr [C(A q - B q )}\ < | tr [(a (A - B) 2 + s' 1 C 2 )(|A| 9 " 1 + \B\ q ~ 1 )} . 

The proofs of Theorem 3.3 and Lemma 3.4 can be found in Appendix C. We remark that both 
results extend directly to infinite-dimensional Schatten-class operators. 

4. Example: Matrix Bounded Differences Inequality 

As a first example, we show how to use Theorem 3.1 to derive a matrix version of McDiarmid's 
bounded differences inequality [14]. 

Corollary 4.1 (Matrix Bounded Differences). Suppose that Z := (Z 1; . . . , Z n ) G Z is a vector 
of independent random variables that takes values in a Polish space Z. Let H : Z — >• M d be a 
measurable function, and let (A 1; . . . , A n ) be a deterministic sequence of Hermitian matrices 
that satisfy 

{H{z u ...,z n )- H(zt, ...,z' j7 ..., z n )) 2 4 A 2 
where Zk, z' k range over the possible values of Z^ for each k. Compute the boundedness parameter 



a 2 :=\\y n A 2 
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Then, for allt >0, 

P{A max (H(Z) - EH(Z)) > t} < d ■ e~ t2 ^\ 

Furthermore, 

E\ mi , K (H(Z)-EH(Z))<a v / ]^d. 

Proof. Introduce the random matrix X = H(Z) — EH(Z). We can use the kernel Stein pair 
constructed in Section 2.4 to study the behavior of X. According to (2.12), the conditional 
variance satisfies 




According to (2.11), the kernel conditional variance satisfies 

Invoke Theorem 3.1 with c = 0, v — cr 2 /2, and s = n to complete the bound. □ 

Corollary 4.1 improves on the matrix bounded differences inequality [23, Cor. 7.5], which 
features an additional factor of 1/8 in the exponent of the tail bound. It also strengthens the 
bounded differences inequality [13, Cor. 11.1] for matrix Stein pairs, which requires an extra 
assumption that the function H is "self-reproducing." 

Remark 4.2 (Extensions). The conclusions of Corollary 4.1 hold with a 2 := ||^4. 2 || under either 
one of the weaker hypotheses 

^2{H{z 1} ...,z n )- H( Zl , . . . , z' 3 , . . . , z n )f ^ A 2 

or 

£ E [{H{z u ...,z n )- H(z u ...,Zj..., z n )) 2 ] 4 A 2 

where A 6 W d is deterministic and Zk, z' k range over all possible values of Z^ for each index k. 
This claim follows from a simple adaptation of the argument in Section 2.4. 

We can also obtain moment inequalities for the random matrix H(Z) by invoking The- 
orem 3.3. We have omitted a detailed statement because exponential tail bounds are more 
popular in applications. 

5. Example: Matrix Bounded Differences without Independence 

A key strength of the method of exchangeable pairs is the fact that it also applies to random 
matrices that are built from weakly dependent random variables. This section describes an 
extension of Corollary 4.1 that holds even when the input variables exhibit some interactions. 

To quantify the amount of dependency among the variables, we use a Dobrushin interde- 
pendence matrix [7]. This concept involves a certain amount of auxiliary notation. Given a 
vector x = (xi, . . . , x n ), we write cc_j = (xi, . . . x i+ i, . . . , x n ) for the vector with its ith 
component deleted. Let Z = (Zi, . . . , Z n ) be a vector of random variables taking values in a 
Polish space Z with sigma algebra J 7 . The symbol 1 Z-i) refers to the distribution of Zi 
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conditional on the random vector Z_^ We also require the total variation distance g?tv between 
probability measures u and v on (Z,T): 

d T y(u, //) := sup \v{A) — a(A)\ . (5.1) 
With this foundation in place, we can state the definition. 

Definition 5.1 (Dobrushin Interdependence Matrix). Let Z = (Zi, . . . , Z n ) be a random vector 
taking values in a Polish space Z. Let D G W nxn be a matrix with a zero diagonal that satisfies 
the condition 

d T v (/■*»(• 1 x-t),(ii(- 1 y-i)) < ^2 Dijl[xj ^ yj] (5.2) 

for each index % and for all vectors x,y G Z. Then D is called a Dobrushin interdependence 
matrix for the random vector Z. 

The kernel coupling method extends readily to the setting of weak dependence. We obtain 
a new matrix bounded differences inequality, which is a significant extension of Corollary 4.1. 
This statement can be viewed as a matrix version of Chatterjee's result [4, Thm. 4.3]. 

Corollary 5.2 (Dobrushin Matrix Bounded Differences). Suppose that Z := (Z±, . . . , Z n ) in 
a Polish space Z is a vector of dependent random variables with a Dobrushin interdependence 
matrix D with the property that 



max 



{P>lli-i> II*>IL-*»}<1- ( 5 - 3 ) 



Let H : Z — ^ M. d be a measurable function, and let (Ai, . . . , A n ) be a deterministic sequence of 
Hermitian matrices that satisfy 

(H( Zl , ...,*„)- H(z u ...,z'j,..., z n )) 2 4 A) 

where Zk, z' k range over the possible values of Z k for each k. Compute the boundedness and 
dependence parameters 



cr 2 := II > ' A 



and b :- 



1 n ^ 

i- gOl^lli-i + 11^11^00) 



Then, for all t > 0, 

P{A max (H(Z) -EH(Z)) >t}<d- Q ~ t2 l^ 2 ). 

Furthermore, 

E A max (H(Z) — E H(Z)) < a ^/blogd. 

The proof of Corollary 5.2 appears below in Section 5.1. In Section 6, we describe an appli- 
cation of the result to physical spin systems in Section 6. Observe that the bounds here are a 
factor of b worse than the independent case outlined in Corollary 4.1. 

5.1. Proof of Concentration under Dobrushin Assumptions 

The proof of Corollary 5.2 is longer than the argument behind Corollary 4.1, but it follows the 
same pattern. 
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Exchangeable Counterparts. Let X = H(Z) — 'EH(Z). To begin, we form exchangeable 
counterparts for the random input Z and the random matrix X. 

Z':=(Z 1 ,...,Zj_ 1 ,Zj,Z J+h ...,Z n ) and X' := H(Z') — EH(Z) 

where J is an independent index drawn uniformly from {1, . . . , n} and Zj and Zj are condition- 
ally i.i.d. given Z_i for each index i. 

A Kernel Coupling. Next, we construct a kernel coupling (Z^), Z'^)i>o by adapting the 
proof of [4, Thm. 4.3]. For each i > 1, we generate (Z^, ZL) from (Z^i), ZL^) by selecting an 
independent random index Jj uniformly from {1, . . . , n} and replacing the Jj-th coordinates of 
Z(i-i) and Z'u x s with and z 'u^\\ j, respectively. The replacement variables are sampled 

so that 

Z (i-i),j -U- z (f-i)j I z {*-i),-j and Z [i-i),j ^('i-i)j I Z \i-\)-y 
We require that Z^_i)j and 1 % . are maximally coupled, i.e., 



P ^ Z [i-X)J I = rf Tv(/ij(- | I ^(i-1) 

By construction, the two marginal chains (Z^)i>o and (-Z|^)j>o have the same the kernel as 
(Z, Z'), and they satisfy the kernel coupling property (2.3). Furthermore, the coupling bound- 
edness criterion (2.4) is met, just as in the scalar setting [4, p. 78]. Lemma 2.4 now implies that 
(X, X') is a kernel Stein pair with kernel 

K(z, z') := E \- H ^ ~ H{Z '^ 1 Z ^ = z > Z (o) = A ■ 

The Conditional Variances. With the kernel coupling established, we may proceed to 
analyze the conditional variances Vx and V K . First, we collect the information necessary to 
apply Lemma 2.6. Fix an index % > 0, and write H(Z^) — H(Z'^) as a telescoping sum: 



H(Z {i) ) - H(Z' {i) ) ^ H (Z(i),i, . . . , Zqj, Z[ i)J+1 , . . . , Z' {i) n ) 

j= i L 



Introduce the event E(i)j := {Z^j ^ z {i)j\- Abbreviate p^)j = P {£(i)j \ Z,Z'} and W(j),j = 
EfWmj | Z, Z',£(^j]. Off of the event £(i)j, it holds that Wa)j = 0. Therefore, 

E [ W (i),j \ Z , Z '\ = 
In [4, pp. 77-78], Chatterjee established that, for each i and j, 

p m < e^ej for B:=(l-^I+~D. (5.4) 

We use e^ to denote the fcth standard basis vector, and B % refers to the ith power of the square, 
nonnegative matrix B. 
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1 



w, 



(i),k) P(i),jp(i),k 



= ll B ^llrE[ =1 A fc- e ^ 

^ii B iiU-ELi A *" e * Biej - 



The first semidefinite inequality follows from (1.2). The second relation depends on (5.4). We 
reach the next identity by summing over j, noting that e*B l ej is nonnegative. The last inequal- 
ity follows from the definition (1.4) of 1 1 - 1 1 ^ 1 and the fact that this norm is submultiplicative. 

Next, take the expectation of the latter display with respect to J. We obtain 



E 



E 



i=i 



E\W, 



z, z' 



z 



^ll^llU-rE^^.^-^ 



n 



'l<j,k<n 



= \\B nl 
^ \\B 



n Z-^k=i 



Al 



\elB l 



l-rl 



\B\ 



1 \n 



The justifications are similar with those for the preceding calculation. 

As a consequence of this bound, we are in a position to apply Lemma 2.6. Set T(Z) 
n _1 X/fe=i A\ and Sj = ||-B|| i 1 / ^ 1 H-BH^^oo for each i > 0. The lemma delivers 



and 



— — j V^V 

.ooll^lll^lj 2n 'I. 

where a 2 is defined in the statement of Corollary 5.2. It remains to simplify the formula for the 
kernel conditional variance. 

The definition of B ensures that 



1-1(1 

n 



\D\ 



and 



I B II 

I I loo— >oo 



As a consequence of the geometric-arithmetic mean inequality, 



l-^lll^l ll-^lloo-KJO 



We conclude that 



^(II*IU 



1 

> - 

n 



ir>l 



id 

n 



\D\ 



\D\ 



) 



-2 2 



I 



nb 2 ~ 2 



I. 



where b is defined in the statement of Corollary 5.2. 

Finally, we invoke Theorem 3.1 with c = and v = ba 2 /2 and s = nb to obtain the advertised 
conclusions. 
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6. Application: Correlation in the 2D Ising Model 

In this section, we apply the dependent matrix bounded differences inequality of Corollary 5.2 
to study correlations in a simple spin system. Consider the 2D Ising model without an external 
field on an n x n square lattice with a periodic boundary. Let cr := (cr^- : 1 < i,j < n) be 
an array of random spins taking values in {+1,-1}. To simplify the discussion, we treat array 
indices periodically, so we interpret the index % to mean {{i — 1) mod n) + 1. We also write 
~ {k,l) to indicate that the vertices are neighbors in the periodic square lattice; that is, 
k — i ± 1 and I = j or else k — i and I = j ± 1. With this notation, the Hamiltonian may be 
expressed as 

H(a) = ^2 a -ij a ki, 

(ij')~(*.0 

where the sum occurs over distinct pairs of neighboring vertices. We assign a probability dis- 
tribution to the array cr of spins: 

P{<x} = iexp(/3tf(<r)), (6.1) 

where A = Y^a-' ex P {flH(cr')^ denotes the normalizing constant (also known as the partition 
function). This model has been studied extensively, and it is known to exhibit a phase transition 
at j3 c = \ log(l + y/2). For example, see [19]. 

Fix a positive number d < n. For indices 1 < i,j < d, we define the spin-spin correlation 
function as 

Cij = E[Oll<Ty]. 

We write C for the d x d matrix whose entries are Cy . The paper [25] of Wu offers an explicit 
expression for the correlations in the limit as the size n of the lattice tends to infinity. In 
particular, in the high-temperature regime /3 < /3 C , the correlations decay exponentially. On the 
other hand, this is a limiting result and there is no analytic formula for finite lattices. 

One may wish to estimate the spin-spin correlation matrix from a sampled value cr of the 
spins. We propose the estimator 

&j '■= \ ^2 ° kl ' a k+i-i,i+j~i for 1 < i,j < d. (6.2) 

\<k,l<n 

The mean of C is the spin-spin correlation matrix C, so it is natural to wonder about the 
deviations of the estimator from its mean value. We can use the concentration results from the 
previous section to quantify these fluctuations. 



6.1. Concentration for General Matrices 



Since the estimator C need not be Hermitian, we need a way to extend our techniques to 
general matrices. We employ a well-known device from operator theory, called the Hermitian 
dilation [23, Sec. 2.6]. 

Definition 6.1 (Hermitian dilation). Consider a matrix B £ C dlXd2 , and set d = d\ + di- The 

Hermitian dilation of B is the matrix 



V{B) 





B* 



B 




£ U d . 
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The dilation preserves spectral properties in the sense that X max (T>(B)) = \\V(B)\\ = \\B\\. 
Therefore, 

(6.3) 



P{||C-C|| >t\ =F\\ max (V(C)-V(C)) >t 



Using this observation, we can obtain a tail bound for the spectral-norm error in the estimator 
C by studying its dilation. 

6.2. Bounding the Dobrushin Coefficients 

To apply Corollary 5.2, we need to bound the Dobrushin coefficients of the array er of spins. 
Let o~' G {±l} nxn be a second independent draw from the Ising model. Extending our notation 
from before, we write | er_u t j\) for the conditional distribution of ay given the remaining 
variables. In our setting, 



P i on = 1 V G H \ - P i 



E 



(fc,0:(ij>(fc,0 



kl 



(6.4) 



It follows from (6.1) that 



P <! ^ = 1 



exp(s/3) 1 
z) <?ki - s | - gxp^) + exp (_ s £) ~ 1 + exp(-2s^) 



for each possible value s G {—4,-2,0,2,4}. Therefore, the expression (6.4) admits the upper 
bound 

1 1 



l + exp(-4/3) 2 

when er and er' differ in a single coordinate. We may select the Dobrushin interdependence 
matrix 

;i + exp(-4/3))- 1 - ±, when(z,j) ~ (k, I) 
0, otherwise. 

This matrix satisfies the Dobrushin condition (5.2). By direct computation, 

4 



D 



(i,j),(*.0 



max 



{IWIi-i, ll-OI 



- exp(-4/3) 



(6.5) 



because every vertex has four neighbors. The right-hand side of (6.5) is smaller than one 
precisely when (3 < (3 D = |log(3). Since f3r> < (3 C , the hypotheses of Corollary 5.2 are satisfied 
for only part of the high-temperature regime. 

6.3. Tail Bound for the Estimator 

We intend to apply Corollary 5.2 to the Hermitian matrix T>(C). For each index 1 < i,j < n, 
write C u for the value of C when the sign of cry is flipped. From (6.2), we have the inequalities 



C M - C l k \ 



<— for 1 < k, I < d. 

n 2 
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As a consequence, we reach the semidefinite relation 

Summing over all vertices in the lattice, we obtain an inequality for the boundedness parameter 

16d 2 

n 2 

For (3 < (3 D , Corollary 5.2 implies that 

p(||C-C|| >t) < 2d-exp ( ^— — -• ] . 

1 11 II - / - p V3- 4 ( 1 + exp(-4/3))~ 1 16d 2 J 

Therefore, the typical deviation E ||C — C|| has order (dy/\og d)/n. Therefore, in the regime 
where /3 < (3d, one sample suffices to obtain an accurate estimate of the spin-spin correlation 
matrix C, provided that n^> d. 

7. Complements 

The tools of Section 3 are applicable in a wide variety of settings. To indicate what might 
be possible, we briefly present another packaged concentration result. We also indicate some 
prospects for future research. 

7. 1 . Matrix- Valued Functions of Haar Random Elements 

This section describes a concentration result for a matrix-valued function of a random element 
drawn uniformly from a compact group. This corollary can be viewed as a matrix extension 
of [4, Thm. 4.6]. We provide the proof in Appendix D. 

Corollary 7.1 (Concentration for Hermitian Functions of Haar Measures). Let Z ~ fi be Haar 
distributed on a compact topological group G, and let ^ : G — > M d be a measurable function 
satisfying Kty(Z) = 0. Let Y, Yi,!^, • • • be i.i.d. random variables in G satisfying 

Y = d Y~ x and zYz~ l = d Y for all z G G. (7.1) 

Compute the boundedness parameter 

a := — \ minjl, ARS~ d TV {(J.i, n)\ 
where /ij is the distribution of the product Y> ■ ■ ■ Y\, 

\\*(z)\\<R for all zeG, and S 2 = sup ||E [(*(#) - ^(Yg)) 2 ] || . 

geG 

Then, for allt>0, 

¥{\ max (*(Z))>t}<d-e- t2 /^ 2 \ 

Furthermore, 

EA max (*(Z)) < ay/2\ogd. 



a 



lQd 2 



n' 
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Corollary 7.1 relates the concentration of Hermitian functions to the convergence of random 
walks on a group. Since representation theory leads to a matrix model of compact groups, it is 
often natural to build random matrices from a group representation. In particular, Corollary 7.1 
can be used to study matrices constructed from random permutations or random unitary ma- 
trices. We omit the details. 

7.2. Conjectures and Consequences 

Lugosi et al. [3] study a class of self-bounding (scalar) functions, which arise in applications in 
statistics and learning theory. They use log-Sobolev inequalities to obtain information about 
the concentration properties of these functions. It is also possible to perform the analysis using 
the method of exchangeable pairs. 

Let us introduce the matrix analog of a self-bounding function. 

Definition 7.2 (Self-bounding Matrix Function). A function H : Z — > M. d is called (a, 6) 
matrix self-bounding if, for any Z, Z' £ Z, 

1. H{Z) - H(z!, . . . ,z't, . . . ,z n ) ^ I, and 

2. Eti(H(Z) - H( Zl , . . . , 4, . . . , z n )) + 4 aH(Z) + 61. 

H is weakly (a, b) matrix self-bounding if, for any Z, Z' £ Z, 

J2 U i=l (H(Z) - H{ Zl , ...,zl,..., z n ))\ 4 aH{Z) + 61. 

Mackey [12, Thm. 25] proposed a slightly different definition that includes an additional self- 
reciprocity condition. His analysis requires this extra hypothesis because it is based on matrix 
Stein pairs. 

The approach in this paper is not quite strong enough to develop concentration inequalities 
for self-bounding matrix functions. Our techniques would work if the following mean value trace 
inequality were valid. 

Conjecture 7.3 (Signed Mean Value Trace Inequalities). For all matrices A,B,C £ M d , all 

positive integers q, and any s > it holds that 

tr [C(e A - e B )] < \ tr [(s(A - B)\ + s' 1 C 2 + )e A + (s(A - B)l + s^C^e B )] . 

and 

tr [C{A q - B g )] < | [(s(A - B) 2 + + s~ l C 2 + ) \ A| 9_1 + (s(A - B) 2 _ + S - l C 2 _) \B\ q - x )\ . 

This statement involves the standard matrix functions that lift the scalar functions x + := 
max{x, 0} and x_ := max{— a,0}. Extensive simulations with random matrices suggest that 
Conjecture 7.3 holds, but we did not find a proof. 
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Appendix A: Operator Inequalities 

Our main results rely on some basic inequalities from operator theory. We are not aware of 
good references for this material, so we have included short proofs. 

A.l. Young's Inequality for Commuting Operators 

In the scalar setting, Young's inequality provides an additive bound for the product of two 
numbers. More precisely, for indices p,q G (1, oo) that satisfy the conjugacy relation p~ l + q~ 1 = 
1, we have 

ab < - \a\ p + - \b\ q for all a, b el. (A.l) 
P Q 

The same result has a natural extension for commuting operators. 

Lemma A.l (Young's Inequality for Commuting Operators). Suppose that A and B are self- 
adjoint linear maps on the Hilbert space M d that commute with each other. Let p,q G (l,oo) 
satisfy the conjugacy relation p^ 1 + q~ l = 1. Then 

AB4-\A\ p + -\B\ q . 

p q 

Proof. Since A and B commute, there exists a unitary operator IA and diagonal operators D 
and Ai for which A = WDU* and B = UM.U* . Young's inequality (A.l) for scalars immediately 
implies that 

DM 4 - \V\ P + - \M\ q . 

p q 

Conjugating both sides of this inequality by U, we obtain 

AB = U(VM)W 4 -U \V\ P U* + -U \M\ q W = - \A\ P + - \B\ q . 

p q p q 

The last identity follows from the definition of a standard function of an operator. □ 
A. 2. An Operator Version of Cauchy-Schwarz 

We also need a simple version of the Cauchy-Schwarz inequality for operators. The proof follows 
a classical argument, but it also involves an operator decomposition. 

Lemma A. 2 (Operator Cauchy-Schwarz). Let A be a self-adjoint linear operator on the Hilbert 
space M. d , and let M and N be matrices in ~M d . Then 

|(M, A(N)} \ < [ (M, \A\(M)).(N, |.4| (iV)> ] 1/2 . 

The inner product symbol refers to the trace, or Frobenius, inner product. 

Proof. Consider the Jordan decomposition A = A + — A-, where A + and A- are both positive 
semidefinite. For all s > 0, 

< ((sM - s^iV), A + (sM - s^N)) 
= s 2 (M, A + {M)) + s~ 2 (N, A+{N)) - 2 (M, A+{N)) . 
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Likewise, 

< ({sM + s _1 iV), A^sM + s^N)) 
= s 2 (M, A-(M)) + s~ 2 (N, A-(N)) + 2 (M , A-(N)) . 

Add the latter two inequalities and rearrange the terms to obtain 

2 (AT, A{N)) < s 2 (M, \A\ (M)> + s~ 2 (N, \A\ (JV)> , 

where we have used the relation |^4| = + A-. Take the infimum of the right-hand side over 
s > to reach 

(M, *4(iV)> < [ (M, \A\ (M)> ■ (M, (JV)> ] V2 . (A.2) 

Repeat the same argument, interchanging the roles of the matrices sM—s~ 1 N and sM+s~ l N. 
We conclude that (A.2) also holds with an absolute value on the left-hand side. This observation 
completes the proof. □ 

Appendix B: Proof of the Exponential Tail Bound 

This appendix contains a proof of the exponential tail bound Theorem 3.1. The argument 
parallels the approach developed in [13], but we require more powerful estimates along the way. 
In view of the similarities, we emphasize the places where the proofs differ, and we suppress 
details that are identical with the earlier work. 



B.l. The Matrix Laplace Transform Method 

A central tool in our investigation is a matrix variant of the classical moment generating func- 
tion. Ahlswede & Winter [1, App.] introduced this definition in their investigation of matrix 
concentration. 

Definition B.l (Trace Mgf). Let X be a random Hermitian matrix. The (normalized) trace 
moment generating function of X is defined as 

m {9) := m x {9) := Etre 9X for 9 G R. 

The following proposition from [13, Prop. 3.3] collects results from [1, 18, 23, 6]. 

Proposition B.2 (Matrix Laplace Transform Method). Let X 6l d be a random matrix with 
normalized trace mgf m{6) '■— Ktre 9X . For each t EM., 

P{A max (X) > t} < d ■ inf exp{-#t + logm(0)}. (B.l) 

6»>0 

P{A rnin (X) < t} < d ■ inf exp{-9t + logm(0)}. (B.2) 

Furthermore, 

EA max (X) < inf \ [logd + logm(0)]. (B.3) 
EA min (X) > sup - [logd + logm(0)]. (B.4) 

6»<0 V 

In summary, we can bound the extreme eigenvalues of a random matrix by controlling the trace 
mgf. 
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B.2. The Method of Exchangeable Pairs 

The main technical challenge in developing concentration inequalities is to obtain bounds for the 
trace mgf. In this work, we follow the approach from the paper [13], which extends Chatterjee's 
concentration argument [5] to the matrix setting. The key idea is to use an exchangeable pair 
to bound the derivative of the trace mgf, which in turns allows us to control the growth of the 
trace mgf. 

We begin with a technical lemma, which generalizes [13, Lem. 2.3] and [4, Lem. 3.1]. This 
result permits us to rewrite certain matrix expectations using kernel Stein pairs. 

Lemma B.3 (Method of Exchangeable Pairs). Suppose that (X,X') eW 1 xM d is a K -Stein 
pair constructed from an auxiliary exchangeable pair (Z, Z'). Let F : U d -»■ U d be a measurable 
function that satisfies the regularity condition 

E\\K(Z,Z') ■ F(X)\\ < oo. (B.5) 

Then 

E [X ■ F(X)] = l - E [K(Z, Z'){F{X) - F(X'))] . (B.6) 
Proof. Definition 2.2, of a kernel Stein pair, implies that 

E[X ■ F(X)) = E [E[K(Z, Z') \ Z] ■ F(X)] = E[K(Z, Z') F(X)], 

where we justify the pull-through property of conditional expectation using the regularity con- 
dition (B.5). Since the kernel K satisfies the antisymmetry property (2.1), we also have the 
relation 

E[K(Z, Z') F{X)\ = E[K(Z', Z) F(X')] = - E[K(Z, Z') F(X% 

Average the two preceding displays to reach the identity (B.6). □ 

Under suitable regularity conditions, the derivative of the trace mgf of a random matrix X 
has precisely the form needed to invoke to the method of exchangeable pairs: 

m'{9) = Etr [Xe 9X ] . 

Hence, we may apply Lemma B.3 with F(X) = e ex to obtain the expression 

m\6) = ^Etr [K(Z, Z') (e dX - e 0X ')] . (B.7) 

The primary novelty in this work is a method for bounding the right-hand side of (B.7). 

B.3. The Exponential Mean Value Trace Inequality 

To control the expression (B.7) for the derivative of the trace mgf, we will invoke Lemma 3.2, 
the exponential mean value trace inequality. We establish this key lemma in this section. See 
the manuscript [20] for an alternative proof. 
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Proof of Lemma 3.2. To begin, we develop an alternative expression for the trace quantity that 
we need to bound. Observe that 

± e rA e (l-T)B = e rAt A _ B ) e (l-r)B_ 

dr 

The Fundamental Theorem of Calculus delivers the identity 

e A -e B = C A e ^ e ^ B dr= C e rA (A - B)e^ B dr. 
Jo dr Jo 

Therefore, using the definition of the trace inner product, we reach 



tr [ C (e^-e B )] = [\c, e TA (A-B)e^ B )dr. 
Jo 



(B.8) 



We can bound the right-hand side by developing an appropriate matrix version of the inequality 
between the logarithmic mean and the arithmetic mean. 

Let us define two families of positive-definite operators on the Hilbert space M. d : 

Ar(M) = e rA M and #i_ r (M) = Me {1 ~ r)B for each r e [0, 1]. 

In other words, A T is a left-multiplication operator, and £>i_ r is a right-multiplication operator. 
It follows immediately that A T and £>i_ T commute for each r e [0, 1]. Young's inequality for 
commuting operators, Lemma A.l, implies that 

ArB^ T 4 r ■ \A T \ 1/T + (1 - r) ■ |Bi_ T | 1/(1 - T) = r ■ |A| + (1 - r) ■ |^| . 

Integrating over r, we discover that 

J^ArB^dr ^ ^{\A X \ + \B X \) = ^(A + Si). (B.9) 

This is our matrix extension of the logarithmic-arithmetic mean inequality. 

To relate this result to the problem at hand, we rewrite the expression (B.8) using the 
operators A T and B\~ T . Indeed, 



tr [C(e A - e B )} = [ (C, (A T B^ T )(A - B)) dr 
Jo 

[ (C, {A T B^ T ){C))dr- f 1 (A — B, (A T B^ T )(A - B)) dr 
Jo Jo 



1/2 

. (B.10) 



The second identity follows from the definition of the trace inner product. The last relation 
follows from the operator Cauchy-Schwarz inequality, Lemma A. 2, and the usual Cauchy- 
Schwarz inequality for the integral. 

It remains to bound the two integrals in (B.10). These estimates are an immediate conse- 
quence of (B.9). First, 

1 (C, (A T Bt- T )(C)) dr < l - (C, (Ai + B 1 ){C)) 

= \(C, e A C + Ce B ) = Ui[C 2 (e A + e B )]. (B.ll) 
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The last two relations follow from the definitions of the operators A\ and B\, the definition of 
the trace inner product, and the cyclicity of the trace. Likewise, 

/ (A - B, (A T Bi- T )(A - B)> dr = \ tr [(A - Bf (e A + e B )} . (B.12) 
Jo 1 

Substitute (B.ll) and (B.12) into the inequality (B.10) to reach 

tr [C(e A - e B )] < V tr [C 2 (e A + e B )] ■ tr [(A - Bf (e A + e B )] \ '\ 

We obtain the result stated in Lemma 3.2 by applying the numerical inequality between the 
geometric mean and the arithmetic mean. □ 



B.4- Bounding the Derivative of the Trace Mgf 

We are now prepared to obtain a bound for the derivative of the trace mgf in terms of the 
conditional variance and the kernel conditional variance. 

Lemma B.4 (The Derivative of the Trace Mgf). Suppose that (X, X') is a K -Stein pair, and 
assume that X is almost surely bounded in norm. Define the normalized trace mgf m{9) := 
Etre 9X . Then 



\m 



<-\6\- inf Etr \{sV x + s^V K ) e ex ] for all 6 e 



(B.13) 



The conditional variances Vx and V K are defined in (2.6) and (2.7). 
Proof. Consider the derivative of the trace mgf 

" d 



m'{6) = Etr 



d9 



Etr [Xe ex ]. 



(B.14) 



where the dominated convergence theorem and the boundedness of X justify the exchange 
of expectation and derivative. When 9 = 0, we have m'{9) = 0, as advertised. When 6^0, 
the form of this derivative is ripe for an application of the method of exchangeable pairs, 
Lemma B.3. Since X is bounded, the regularity condition (B.5) is satisfied, and we obtain 



m'(0) = - Etr [K(Z, Z')(' 



OX OX' 



)]■ 



(B.15) 



The exponential mean value trace inequality, Lemma 3.2, implies that 



\m 



'(9)\ < i • inf Etr [(s (9X - 6X') 2 + s~ x K{Z, Z'f) ■ (e 9X + e 0X 
= ^ • inf Etr [(s (9X - QX'f + s~ l K(Z, Z'f) ■ e ex ] 
= -\e\- inf Etr \(t (X - X'f + t~ l K(Z, Z'f) ■ e 6X ] 



- \Q\ - inf Etr 

2 t>o 



t 



E [(X - X'f | Z] ■ e 



c f — E [K(Z, Z'f | Z] -e 



ex 



The first equality follows from the exchangeability of (X,X'); the second follows from the 
change of variables s = and the final one depends on the pull-through property of 

conditional expectation. We reach the result (B.13) by introducing the definitions (2.6) and 
(2.7) of the conditional variance and the kernel conditional variance. □ 
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B.5. Bounding the Trace Mgf 

Lemma B.4 gives us a powerful tool for bounding the trace mgf of a random matrix X that is 
presented as part of a kernel Stein pair. The following lemma shows how to derive a trace mgf 
bound from bounds on the kernel conditional variance. 

Lemma B.5 (Trace Mgf Estimates for Bounded Random Matrices). Let (X, X') be a K -Stein 
pair, and suppose there exist nonnegative constants c, v, s for which 

Vx =4 s~ 1 (cX + v I) and V K =4 s (cX + vT) almost surely. (B.16) 

Then the normalized trace mgf m(0) := Ktre ex satisfies the bounds 

v6 2 

\ogm(e)<— when8<0. (B.17) 



v 

log m(6») < — 
c z 



(B.18) 



v9 2 

<— when < 6 < lie. (B.19) 

~ 2(1 - c0) 1 V 1 

The two conditional variances are defined in (2.6) and (2.7). 

Proof. As demonstrated in [13, Lem. 4.3], the assumption (B.16) implies that X is almost 
surely bounded in norm. Hence, we may apply Lemma B.4 along with our conditional variance 
bounds (B.16) to obtain 

\m'{6)\ <\\®\- jnf Etr [(t V x + t _1 V K ) e ex ] 

< ^\6\ ■ Etr [(s V x + s~ l V K ) e 9X ] 

< \9\ - Etr [(cX + vI) e ex ] 

= c\9\- Etr [Xe ex ] + v \6\ ■ Etre 9X 
= c\6\ ■m'(9)+v\9\ -m(9), 

where the third inequality follows from the positivity of e ex . The remainder of the argument 
now proceeds as in [13, Lem. 4.3]. □ 



B.6. Proof of Theorem 3.1 

The remainder of the proof of Theorem 3.1 is identical to that of [13, Thm. 4.1], once we 
substitute the trace mgf estimates from Lemma B.5 in place of the result [13, Lem. 4.3]. We 
omit the details. 



Appendix C: Proof of the Polynomial Moment Inequality 

Next, we develop a proof of the matrix Burkholder-Davis-Gundy inequality, Theorem 3.3. The 
proof parallels the argument in [13], but we need some new matrix inequalities to make the 
extension to kernel Stein pairs. 
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C.l. The Polynomial Mean Value Trace Inequality 

The critical new ingredient in Theorem 3.3 is the polynomial mean value trace inequality, 
Lemma 3.4. Let us proceed with a proof of this result. 

Proof of Lemma 3.4- First, we need to develop another representation for the trace quantity 
that we are analyzing. Assume that A,B,Ce H d . A direct calculation shows that 

A q - B q = V 9 ' 1 A k (A - B)B q ~ 1 ~ k . 

As a consequence, 

tr [C(A q - B q )] = Yl^-o ( C > A ^ A ~ B ) Bq ^ k ) ■ (C- 1 ) 

To bound the right-hand side of (C.l), we require an approriate mean inequality. 
To that end, we define some self-adjoint operators on M d : 

A k {M):=A k M and B k (M) := MB k for each k = 0, 1, 2, . . . , q - 1. 

The absolute values of these operators satisfy 

\A k \(M) = \A\ k M and \B k \ (M) = M \B\ k for each k = 0, 1, 2, . . . , q - 1. 

Note that \A k \ and \B q ^k~i\ commute with each other for each k. Therefore, Young's inequality 
for commuting operators, Lemma A.l, yields the bound 



— \A k \ iq ~ iyk + q - k - 1 " 
q — 1 9 — 1 



l^-jt-il = \A k \ \B q - k - X \ 4 \A k \ {q ' 1)/k + 2 — \B, 



q-k-l\ 



k I A | g— 1 i 9 ^ 1 \ Y2 19-1 



q — 1 g — 1 

Summing over fc, we discover that 



^4iT" + - — - — -\Bx\ q -. (C.2) 



< \ \M q ~ X + ~ ■ (C.3) 



This is the mean inequality that we require. 

To apply this result, we need to rewrite (C.l) using the operators A k and A q - k -x- It holds 
that 

9-1 

tr [C(A q - B q )] = (C, {AuB^^iA - B)) 

k=0 

1/2 



< 



g-1 q-1 

J2(C, lA^-fe-xKC*))-^^-^, lAkB.-^KA-B)) 

,k=0 k=0 



(C.4) 



The second relation follows from the operator Cauchy-Schwarz inequality, Lemma A. 2, and the 
usual Cauchy-Schwarz inequality for the sum. 
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It remains to bound to two sums on the right-hand side of (C.4). The mean inequality (C.2) 
ensures that 

3-1 

J2(c, iAB 9 - fc -ii(c))<^(c, (i^r'+i&r'Kc)) 



§(C, |A| 9 - 1 C+C|B|«- 1 > = |tr[C fl (|A|«- 1 + |B|*- 1 )]. (C.5) 



fc=0 

Likewise, 

9-1 

(A - B, \A k B q ^ x \ (A -B))<| tr [(A - B) 2 ( + | B]^ 1 )] . (C.6) 

fc=0 

Introduce the two inequalities (C.5) and (C.6) into (C.4) to reach 

/ \ I/ 2 

tr[C(A 9 - B*)] < ^Ui[C 2 {\A\ q - 1 + \B\ q ' 1 )] ■ tr [(A - B) 2 ( \A\ q ^ + {B^ 1 )] J . 

The result follows when we apply the numerical inequality between the geometric mean and 
the arithmetic mean. □ 

C.2. Proof of Theorem 3.3 

Abbreviate 

E := E\\X\\f = Etr |X| 2p = Etr [X ■ X 2 ' 1 " 1 ]. 
To apply the method of exchangeable pairs, Lemma B.3, we check the regularity condition (B.5): 
E\\K(Z,Z') -x^W <E(\\K(Z,Z')\\ ||X|| 2p_1 ) 

< (E||ii:(z,z')|| 2p ) 1/2?, (E||x|| 2p ) (2p - 1)/2p <oo, 

where we have applied Holder's inequality for expectation and the fact that the spectral norm 
is dominated by the Schatten (2p)-norm. Invoke Lemma B.3 with F(X) = X 2 ^ 1 to reach 

E = ~ E tr [K{Z, Z') ■ {X 2 ^ 1 - X' 2 ^ 1 ) ] . 

Next, fix a parameter s > 0. Apply the polynomial mean value trace inequality, Lemma 3.4, 
with q = 2p — 1 to obtain the estimate 

E < ?tzl E tr[(s (X - X') 2 + s~ l K{Z, Z') 2 ) ■ (X 2p ~ 2 + X' 2p ~ 2 )} 

= ^F-Zl e tr[(s (X - X') 2 + s' 1 K(Z, Z 1 ) 2 ) ■ X 2p - 2 } 



(2p- l)Etr 



^sVx + s^V^-X 2 ^ 2 



where we have used the exchangeability of (X, X 1 ) and the definitions (2.6) and (2.7) of the 
conditional variances. In the last step, we justify the pull-through property with the regularity 
condition EHX^ < oo. The remainder of the argument is identical with the proof of [13, 
Thm. 8.1]. 
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Appendix D: Haar Measures and Controlled Total Variation 

In this section, we prove Corollary 7.1 by studying the behavior of Hermitian functions of 
group- valued random elements. Under the notation of Corollary 7.1, we define X := ^f(Z). [4, 
Thm. 4.6] showed that scalar functions of the Haar measure are well concentrated whenever 
particular random walks on G converge rapidly to the Haar distribution. In the sections to 
follow, we develop a Hermitian analogue of this relationship using the tools of Sections 2 and 3. 
As in [4], we will adopt the total variation distance between measures (5.1) as our convergence 
metric. 

D.l. A Kernel Coupling 

We begin by establishing a kernel coupling suitable for analyzing X. Since Y e G is independent 
of Z and satisfies (7.1), Z' = YZ is exchangeable counterpart for Z, and hence (X, X') = 
(ty(Z), *&(Z')) is an exchangeable pair. 
Moreover, the sequence of pairs 



defines a kernel coupling for (X,X r ). Thus, (X,X') is a kernel Stein pair with K defined as 
in Lemma 2.4 whenever the precondition (2.4) is met. 

D.2. The Conditional Variances 

The sequence of multipliers (Yi • ■ • Yi)°l ± in our kernel coupling (D.l) can be viewed as a random 
walk on the group G, and, for many choices of Y, this sequence will converge to a Haar dis- 
tributed random variable. Intuitively, a faster rate of convergence implies a faster coupling time 
for the Markov chains (Z^)i> and {Z'^)i>o and hence a smaller /^-conditional variance (2.7). 
Our next lemma makes this intuition more precise by bounding the -RT-conditional variance in 
terms of the total variation distance between Yi ■ ■ ■ Y\ and Z. 

Lemma D.l. Let Z ~ fi be Haar distributed on a group G. Let (X,X') := (^f(Z), ^(Z 1 )) 
with K constructed as in Section D.l. Suppose that /ii is the distribution ofYi- ■ -Y\ and that 



Then (X,X') is a K-Stein pair whenever Yli^o drv{^ii < °°- Moreover, the conditional 
variance (2.6) satisfies 



(Z {lh Z[ i) ):=(Y i ---Y 1 Z,Y l ---Y 1 Z') for each z>0 



(D.l) 



S^supHE^)-*^)) 2 ] 



g&G 



s 2 

Amax (V x ) < — almost surely, 



and the K-conditional variance (2.7) satisfies 




almost surely. 
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E[*(Z ( 



(0; 



= z,z> = zf = (E[*(y • • • y*)] - E[*(y • • • Y x z')]f 

^ 2 E[#(Y< ■ • ■ Y lZ )} 2 + 2 E[*(Y • ■ ■ Y x z')f, 

where the inequality follows from the convexity of the matrix square. For any z G G, 

E[V(Yi ■ ■ ■ Y lZ )\ = E[*(Y • • • Y\z)\ - E[*(Z)] 
= E[*(Y i -..y 1 z)]-E[¥(Z*)], 

since Z is Haar distributed, and hence Z2 =^ Z. Furthermore, for any positive measure v that 
dominates /i and /Xj, 



|E[*(y i ---y 1 «)]| 



J *(yz) (^(y) - ^(y)) dv{y) 
dv(y) 



-r(y) - T\y) 

dv dv 



< 2R d TV (/ij,/i), 

by our bound on ^ and the definition of total variation. Therefore, 

E[E[*(Z (i) ) - (Z^) I Z, Zf I Z] 4 16i? 2 4 v (/i,, n) I. 

We note moreover that 

i=o I E[*(Z (0 ) - *(Z ( ' t) ) I Z (0) = z, Z( 0) = a/] 1 1 < ^ =0 4i? ^tv(aw) 

for all 2 and 2;'. Hence, by Lemma 2.4, (X, X') is a valid If-Stein pair whenever the total 
variation distances are summable. 

Next, let Wi := Yi ■■ -Yi, and notice that 

||E[E[*(Z W ) - 9(Zfo) I Z, Zf I Z = z}\\< ||E[(¥(Z W ) - *(Z { ' 4) )) 2 1 Z = *]|| 

= ||E[(*(iy^)-*(Ty^)) 2 ]|| 

< sup ||E[(*(#z) - *(#Ys)) 2 ]|| 

geG 

= su P ||E[(*(^)-*(y^)) 2 ] 

sec 



<S 2 



where the first inequality is a consequence of the convexity of the matrix square, and the final 
equality follows from the property g~ x Y g =d Y for all g G G. Hence, we may apply Lemma 2.6 
with s = S, with S{ = minis', ARdTvifa, (j)} f° r 2 > 0, and with T(Z) = I to obtain the 
result. □ 



D.3. Exponential Concentration 

We are finally equipped to prove Corollary 7.1. Under the kernel coupling construction of Sec- 
tion D.l, the conditional variance bounds of Lemma D.l imply that Theorem 3.1 holds with c = 
0, with v = \S 2 Yli=o mm {l> dTvifii, A*)}, and with s = YliLo mrn {X 4-RS' -1 dTv{fJ>i, A*)}- 

This establishes the result. 
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